We continue using the data set collected in MATH1005 in S2, 2022.
Make sure you have the data-file math1005_cleaned.csv in
your data folder within your STAT5002 folder.
math1005. Then we isolate the height variable and remove
NAs from it.math1005 = read.csv("data/math1005_cleaned.csv")
str(math1005)
## 'data.frame': 281 obs. of 8 variables:
## $ Gender : chr "Female" "Female" "Female" "Male" ...
## $ International: chr "Domestic" "Domestic" "International" "Domestic" ...
## $ Major : chr "Physics, Chemistry" "Biomedical Engineering" "Statistics" "Transport" ...
## $ Height : num 154 182 156 172 193 167 NA 183 150 181 ...
## $ ShoeSize : num 9.5 9 9 8 13 40.5 NA 11 7 44 ...
## $ Age : int 18 22 18 29 18 19 19 18 18 20 ...
## $ Country : chr "Australia" "Australia" "India" "Nepal" ...
## $ Language : chr "English " "English" "Hindi" "Nepali" ...
height = na.omit(math1005$Height)
The code curve(dnorm(x, m, s), xlim = c(a, b), add=TRUE)
plots a normal curve with mean m and SD s in the domain \((a,b)\) on the x axis. The option
add=TRUE let you add a normal curve to an existing plot.
See the following for an example.
### here is an example
curve(dnorm(x, 0, 0.5), xlim = c(-10, 5), col = "red")
curve(dnorm(x, -2, 2), xlim = c(-10, 5), col = "blue", add=TRUE)
The following code sketches a Normal Curve with mean 40 and SD 15. The red shaded area represents \(P(X < 18)\). The blue shaded area represents \(P(60 < X < 80)\).
curve(dnorm(x,40,15),from=-20,to=100,ylab="Density",main="N(40,225)")
x = seq(-3.5,3.5,length=1000)*15 + 40
y = dnorm(x,40,15)
y18 = dnorm(18, 40, 15)
polygon(c(min(x), x[x<18], 18, 18), c(0, y[x<18], y18 , 0), col="red")
y60 = dnorm(60, 40, 15)
y80 = dnorm(80, 40, 15)
polygon(c(60, 60, x[x>60&x<80], 80, 80), c(0, y60, y[x>60&x<80], y80, 0), col="blue")
We treat the collected heights as a sample of heights of students in U Syd. We want to fit a normal curve to the histogram of this height sample using the sample mean and sample SD.
Now, plot the histogram of height and then plot the normal curve (defined by the sample mean and sample SD) on top of it. Is the normal curve a reasonable approximation to the histogram in this example?
### Write your code here
m = mean(height)
s = sd(height)
hist(height, freq=F)
curve(dnorm(x, m, s), xlim = c(140, 200), add=TRUE)
Answer: Although there are some differences, the normal curve has a reasonable match to the shape of the histogram.
### Write your code here
#
pnorm(160, m, s)
## [1] 0.08464811
1 - pnorm(190, m, s)
## [1] 0.04357546
Answer: Yes, the proportion of students under 160 cm (0.085) is larger than that of students taller than 190 cm (0.044).
### Write your code here
#
qnorm(0.23, m, s)
## [1] 166.1809
In the Australian Football League (AFL) recruiters tend to look for tall male players. We want to use the heights of male students in MATH1005, S2 2002 as a sample to model the Australian male height.
Select the heights of male students from the data set. Plot a
histogram of the selected heights. Construct a normal curve to
approximate the histogram of male heights in math1005. Plot
the resulting histogram.
### Write your code here
Mselect = !is.na(math1005$Height) & math1005$Gender == "Male"
Mheight = math1005$Height[Mselect]
sum(is.na(Mheight)) # make sure there is no NA left
## [1] 0
length(Mheight) # number of data points
## [1] 167
hist(Mheight, freq=F, main="male heights in MATH1005", ylim=c(0, 0.07))
#
m = mean(Mheight)
s = sd(Mheight)
curve(dnorm(x,m,s),from=150,to=200, add=T)
m
## [1] 178.8377
s
## [1] 6.533423
For each of the following questions, try to use only
pnorm and qnorm to calculate the answer.
### Write your code here
pnorm(188, m, s, lower.tail = FALSE)
## [1] 0.08040242
### Write your code here
pnorm(211, m, s, lower.tail = FALSE)
## [1] 4.267265e-07
### Write your code here
pnorm(180, m, s)
## [1] 0.570598
pnorm(170, m, s)
## [1] 0.08807664
pnorm(180, m, s)- pnorm(170, m, s)
## [1] 0.4825214
### Write your code here
pnorm(177, m, s)
## [1] 0.3892476
### Write your code here
qnorm(0.9,m, s)
## [1] 187.2106
### Write your code here
qnorm(0.6,m, s)
## [1] 180.4929
### Write your code here
qnorm(0.75,m, s)
## [1] 183.2445
qnorm(0.25,m, s)
## [1] 174.431
qnorm(0.75,m, s) - qnorm(0.25,m, s)
## [1] 8.813454
For the above questions 1 to 4, answer the questions again by converting the values to standard units and using the standard normal curve.
### Write your code here
su = (188 - m)/s
pnorm(su, lower.tail = FALSE)
## [1] 0.08040242
### Write your code here
su = (211 - m)/s
pnorm(su, lower.tail = FALSE)
## [1] 4.267265e-07
### Write your code here
su1 = (180 - m)/s
su2 = (170 - m)/s
pnorm(su1) - pnorm(su2)
## [1] 0.4825214
### Write your code here
su = (177 - m)/s
pnorm(su)
## [1] 0.3892476
Use the 68%-95%-99.7% rule to calculate the following by hand.
We want to explore the association between shoesize and height. Now
we want to use all data points in math1005.
cor. Hint: the argument use = “complete” will ignore NA
values.### Write your code here
cor(math1005$Height, math1005$ShoeSize, use = "complete")
## [1] 0.01064368
plot
function.### Write your code here
plot(math1005$Height, math1005$ShoeSize)
Answer: There is almost no association between shoesize and height. This could be caused by outliers in the data set.
Since the majority of students reported US shoesize, let’s discard data points with shoesize > 20 as outliers, and then repeat the above procedure, what is your finding?
### Write your code here
Sselect = !is.na(math1005$ShoeSize) & !is.na(math1005$Height) & math1005$ShoeSize < 20
height = math1005$Height[Sselect]
shoe = math1005$ShoeSize[Sselect]
#
cor(height, shoe)
## [1] 0.7757101
plot(height, shoe)
Answer: Now there is a strong positive association between shoesize and height.
Using the data without outliers, verifying the following properties of the correlation coefficient using R.
### Write your code here
cor(shoe, height)
## [1] 0.7757101
cor(height, shoe)
## [1] 0.7757101
EU ShoeSize = US ShoeSize x 1.27 +
30. Now transform the cleaned shoesizes (assuming they are US sizes)
into EU sizes. Then verify that the correlation coefficient is shift and
scale invariant.### Write your code here
EUshoe = shoe * 1.27 + 30
cor(shoe, height)
## [1] 0.7757101
cor(EUshoe, height)
## [1] 0.7757101